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ABSTRACT 

We  all  know  how  to  estimate  a  confidence  interval  for  the  mean  based  on  a  random  sample.  The  interval  is 
centered  on  the  sample  mean,  with  the  half-width  proportional  to  the  sample  standard  error.  We  know  also 
that  terminating  simulations  generate  independent  observations.  What  simulators  appeal-  to  have  overlooked 
is  that  independence  alone  is  insufficient  to  guarantee  a  valid  random  sample — the  observations  must  also  be 
identically  distributed.  This  is  a  good  assumption  if  the  outcome  of  each  replication  is  a  single  observation, 
but  it  is  demonstrably  incorrect  if  the  outcome  is  an  aggregate  value  and  the  replications  have  differing 
numbers  of  observations.  In  this  paper  we  explore  the  implications  of  this  oversight  when  within-replication 
observations  are  independent.  We  then  derive  analytic  results  showing  that  although  the  impact  on  interval 
estimates  can  sometimes  be  negligible,  there  also  are  circumstances  where  the  variance  of  our  estimates 
is  significantly  increased.  We  finish  with  a  simple  example  which  demonstrates  the  potential  impact  for 
practitioners. 

1  INTRODUCTION 

Analysis  of  terminating  simulations ,  i.e.,  models  that  halt  when  they  reach  some  clearly  defined  state, 
seems  straightforward  (Banks  et  al.  2000;  Bratley,  Fox,  and  Schrage  1983;  Hoover  and  Perry  1989;  Law 
and  Kelton  2000).  If  each  run  is  seeded  independently  of  the  others,  then  the  output  measures  from  each 
run  will  be  independent  and  we  can  just  apply  classical  statistics,  right?  It  turns  out  that  the  answer  to  that 
question  may  be  “wrong!”  The  “identically  distributed”  requirement  of  classical  statistics  can  be  called 
into  question  when  the  output  measure  is  an  aggregate  such  as  a  sum  or  sample  mean. 

Performance  measures  of  a  simulation  will  always  be  some  function  of  the  simulation  state.  In  the  case 
of  terminating  simulations,  the  performance  measure  is  reported  upon  termination.  Each  replication  will 
produce  one  observation  of  the  performance  measure.  If  that  observation  directly  represents  an  end  state 
such  as  the  number  of  failed  components  after  a  week’s  operation,  or  the  number  of  patients  processed 
in  24  hours  of  emergency  room  operations,  there’s  no  problem — the  set  of  values  obtained  by  replication 
represent  a  random  sample  from  the  distribution  of  possible  end  states,  and  classical  statistics  applies. 
However,  we  can  run  into  trouble  if  the  performance  measure  is  an  aggregate  measure,  such  as  an  average, 
and  the  number  of  observations  contributing  to  the  aggregate  varies  from  replication  to  replication. 

2  BACKGROUND  AND  NOTATION 

Let  Y,/  be  the  jth  raw  observation  from  the  ith  replication  of  our  model,  with  common  mean  py  and  variance 
Gy .  For  this  paper  we  will  assume  that  we  are  performing  r  independently  seeded  replications  of  our  model. 
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the  ith  replication  of  our  terminating  simulation  produces  a  single  aggregate  measure 


Yi 


Ilk, 

U‘  7=1 


for  i  =  1 . . .  r  based  on  /;,■  observations.  Note  that  /;,  is  explicitly  allowed  to  vary  from  replication  to 
replication.  We  will  denote  the  total  number  of  observations  as  N  =  The  F//  s  are  independent 

across  i  by  virtue  of  independent  seeding,  and  in  this  paper  we  will  assume  that  they  arc  also  independent 
within  a  replication  to  emphasize  the  effect  of  varying  the  sample  sizes.  Given  these  assumptions,  it  should 
be  clear  that  E  [P/]  =  u,  and  Varffi )  =  <T2/n(-.  They  arc  not  identically  distributed,  because  the  variance 
varies  from  run  to  run. 

We  needn’t  worry  about  initial  bias  in  a  terminating  scenario.  The  traditional  wisdom  at  this  point 
would  be  to  calculate  a  grand  sample  mean  and  estimate  the  variance  across  the  replications: 
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and  use  them  to  construct  a  100(1  —  a)  %  confidence  interval  of  the  form 

Y  ±ta/ 2;r-l 

where  ta/ i:r- 1  is  the  critical  value  from  Student’s  T  distribution  with  r  —  1  degrees  of  freedom  and  probability 
a/2  in  each  tail. 


3  IMPACT  OF  UNEQUAL  SAMPLE  SIZES 

So  what  is  the  impact  on  P  of  using  P/’s  with  different  sample  sizes?  It’s  easy  to  see  that  P  is  unbiased: 
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The  variance  of  P  is  similarly  easy  to  derive. 


Unr(P)  =  Var  =  -_Y,Var&) 
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Note  that  when  sample  sizes  arc  identical  equation  (4)  yields  the  familial-  minimum  variance  unbiased 
estimate  (MVUE)  result.  With  equal  sample  sizes  /;,  =N/r  Vi,  and  thus 
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Var(Y )  is  minimized  when  the  sample  sizes  arc  all  equal,  and  maximized  when  r—  1  of  the  Yfs  have  a 
single  observation  and  the  one  remaining  f,  has  the  remaining  N  —  (r—  1)  observations. 

The  preferred  approach,  which  is  well  known  in  classical  statistics  (Kutner  et  al.  2005)  but  has  been 
ignored  or  overlooked  in  simulation  other  than  by  Sanchez  and  White  (2011),  is  to  use  a  weighted  average 
estimator  of  the  form 

K  =  I>i)  s.t  2>  =  1,  Wi>  0  Vi  (5) 

i=l  i=l 

and  the  coiTesponding  unbiased  variance  estimator  is 
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Using  a  convex  weighting  scheme  like  this  results  in  an  unbiased  estimator,  but  the  choice  of  weights 
affects  the  variance  of  the  estimate  of  the  sample  mean.  The  set  of  weights  which  minimizes  the  variance 
of  Yw  is  {  w,  =  tii/N},  in  which  case 


Var(Yw )  =  Var(£wiYi) 
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In  other  words,  with  the  recommended  weights  the  weighted  estimator  is  MVUE  regardless  of  the  varying 
sample  sizes,  while  the  naive  estimator  is  only  MVUE  when  the  sample  sizes  are  equal. 


3.1  Worst  Case  Behavior 


We  can  assess  the  relative  impact  (RI)  of  using  the  naive  estimator  by  looking  at  the  ratio  of  the  variance 
of  Y  to  that  of  Yw: 
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Another  way  of  looking  at  this  is  to  recall  that  N  is  the  sum  of  the  nf  s,  yielding 


RI  = 


(9) 


Thus  the  relative  impact  is  simply  the  average  sample  size  per  replication  times  the  average  of  the  inverse 
sample  sizes. 

It’s  easy  to  confirm  that  RI  is  1  when  the  s  arc  equal,  and  increases  for  any  other  combination  of 
sample  sizes.  The  maximum  variance  sampling  scenario  yields 
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For  example,  if  you  have  ten  observations  total  and  five  replications,  the  worst  case  scenario  is  that  four 
of  the  replications  produce  averages  based  on  a  single  observation  while  the  remaining  replicate  produces 
an  average  of  six  observations.  In  that  case  the  naive  average  of  the  averages  would  have  |  the  variance 
of  the  weighted  estimator. 
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3.2  An  Interesting  Alternative  Measure 

The  maximum  variance  scenario  is  unlikely  to  occur  in  practice,  so  arguably  doesn’t  give  much  insight 
other  than  by  bounding  RI.  At  this  point  we  have  no  idea  of  whether  that’s  a  loose  bound  or  a  tight  one. 
Consider  the  following  alternative  measure.  Suppose  the  number  of  replications  is  even  and  that  half  of 
the  replicates  have  the  same  sample  size  niow  while  the  other  half  have  sample  size  «/„•.  The  total  sample 
size  is  then  N  =  (r/2)(n/ow  +  «/„•).  Substituting  this  into  equation  (8),  we  get 
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Note  that  this  measure  is  invariant  in  both  N  and  r,  hence  the  subscript  7.  If  we  describe  the  observed  sample 
sizes  as  a  set  {«i,«2,  ■  •  •  ,«r},  this  says  that  experiments  with  {1,4},  {25, 100},  and  {1, 1, 1, 1, 1,4, 4, 4, 4, 4} 
all  have  the  same  sample-size  ratio  «/„■  /niow  =  4,  and  therefore  all  have  the  same  relative  impact  RI  =  1 .5626. 
The  estimated  variance  of  the  mean  will  be  more  than  50%  larger  when  using  Y  rather  than  Yw. 
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Figure  1:  Invariant  relative  impact  vs.  the  ratio  of  maximum  sample  size  to  minimum  sample  size. 
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Figure  1  shows  the  variance  inflation  on  the  vertical  axis  as  a  function  of  the  sample-size  ratio  along 
the  horizontal  axis.  As  an  example,  a  5:1  sampling  ratio  inflates  variance  by  a  factor  of  1.8  when  the  naive 
estimator  is  used.  Note  that  the  relative  impact  is  non-linear  for  sample-size  ratios  below  2  but  becomes 
very  nearly  linear  for  higher  ratios. 

4  AN  EXAMPLE:  TURNING  LEMONS  INTO  LEMONADE 

Consider  a  system  that  processes  items  serially,  one  at  a  time.  Each  item  is  inspected  after  processing. 
If  the  item  fails  inspection,  it  is  reprocessed  and  inspected  once  again.  The  reprocessing/inspection  cycle 
continues  indefinitely  until  the  item  finally  passes  inspection.  Items  failing  inspection  arc  “lemons”,  and 
the  probability  that  a  lemon  passes  subsequent  inspections  after  any  reprocessing/inspection  cycle  is  far 
less  than  the  initial  probability  that  the  item  is  a  lemon. 

For  this  example,  the  total  processing-and-inspection  time  on  any  cycle  is  random  variable  P  distributed 
TRI( 25,30, 35)  minutes  with  E\P]  =  30.  The  probability  that  an  item  is  not  a  lemon  (and  passes  on  its  first 
inspection)  is  p i  =  0.95.  The  probability  that  a  lemon  passes  inspection  after  any  reprocessing/inspection 
cycle  is  p,  =  0.10.  The  simulation  terminates  after  at  least  480  minutes  have  expired  and  any  item  still  in 
process  passes  inspection.  That  is,  the  system  will  work  overtime  beyond  a  normal  8-hour  day  if  any  item 
is  still  in  process.  The  system  begins  with  a  new  item  immediately  available  and  ready  for  processing.  That 
is,  the  system  regenerates  after  the  terminating  condition  is  achieved  and  there  is  no  initialization  bias. 

The  objective  is  to  estimate  the  average  processing  time  for  all  items.  We  can  compute  this  average 
analytically.  If  the  item  is  a  lemon,  the  number  of  reprocessing/inspection  cycles  incurred  before  the  final 
cycle  which  results  in  the  item  passing  inspection  has  a  GEOM(pr )  distribution.  The  expected  processing 
time  is  therefore 

E[T]=E[P]  + 

=  30^0.95  +  0.05  ^^  (12) 

=  30(1.45)  =43.5. 

We  ran  100  simulation  replications  for  this  system  and  divided  the  results  into  5  sequential  experiments 
of  20  replications  each.  The  resulting  95%  confidence  intervals  for  each  20-replicate  experiment,  as  well  for 
the  single  combined  100-replicate  experiment,  arc  shown  in  Figure  2.  These  intervals  arc  computed  using 
both  the  traditional  approach  (displayed  as  lighter-colored  bars  to  the  left)  and  the  optimal  weighting  scheme 
(displayed  as  darker  bars  to  the  right).  The  true  mean  derived  in  equation  12  is  plotted  as  a  horizontal  red 
line.  Note  that  while  all  of  the  point  estimates  arc  greater  than  the  true  mean — an  artifact  of  the  terminating 
condition  which  we  have  not  corrected — all  but  one  of  the  intervals  cover.  The  exception  is  the  combined 
experiment  using  the  traditional  approach.  This  is  noteworthy  because  without  prior  knowledge  of  the 
correct  answer  most  analysts  would  probably  consider  the  combined  estimator  to  be  more  reliable  because 
of  its  larger  sample  size.  Note  also  that  the  weighted  estimators  provided  both  greater  accuracy  and  greater 
precision  in  all  cases. 

Figure  3  is  a  scatter  plot  of  the  number  of  observations  versus  the  estimated  mean  for  each  of  the 
100  replications.  Note  that  the  weights  applied  arc  proportional  to  the  number  of  observations  for  each 
replication.  The  power  trendline  illustrates  that  larger  numbers  of  observations  are  associated  with  smaller 
estimated  average  processing  times.  The  beneficial  effect  of  the  weighting  scheme  is  clear — larger  weights 
arc  given  to  the  more  common  outcomes  where  large  reprocessing  delays  arc  unusual,  since  lemons  represent 
only  5%  of  all  items  in  the  general  population. 

Note  that  this  example  was  not  difficult  to  construct.  It  is  representative  of  regenerative  processes 
subject  to  conditions  that  tend  to  correlate  the  magnitude  of  an  aggregate  output  measure  with  the  number 
of  observations  obtained  during  any  regenerative  cycle.  In  this  example,  the  events  are  “disruptive”  and 
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Figure  2:  95%  confidence  intervals  on  the  mean  processing  time  computed  using  both  the  traditional 
approach  (lighter  bars  to  the  left)  and  the  recommended  weighting  scheme  (darker  bars  to  the  right)  for 
the  20-replication  experiments  and  the  combined  100-replication  experiment. 


20 


Figure  3:  Scatter  plot  illustrating  negative  correlation  between  the  number  of  items  completed  and  the 
estimated  mean  processing  time  for  each  of  100  replications. 
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the  correlation  is  negative,  as  might  be  the  case  in  systems  with  random  failures  or  vacations.  One  can 
also  think  of  events  that  are  “benign”  and  induce  a  positive  correlation,  such  as  unusually  long  stretches 
of  good  weather  and/or  instances  of  unusually  low  absenteeism  on  crews  for  construction  projects. 

5  CONCLUSIONS 

Using  the  naive  F  estimator  rather  than  Yw  will  often  make  little  or  no  difference.  This  is  the  case  when 
the  terminating  simulation  produces  1)  single-observation  measures  as  its  output;  2)  aggregate  measures 
where  the  terminating  rule  guarantees  identical  sample  sizes;  or  3)  the  sample  sizes  can  vary  from  run  to 
run  but  are  strongly  consistent,  i.e.,  the  ratio  of  max  to  min  sample  size  is  relatively  close  to  one.  In  cases 
1)  and  2),  the  naive  estimator  and  the  weighted  estimator  are  mathematically  identical.  In  case  3)  the  two 
estimators  should  produce  results  that  are  very  close  to  each  other,  but  to  the  extent  that  the  estimates  differ 
the  weighted  estimator  is  the  better  one.  When  none  of  the  three  cases  applies,  the  weighted  estimator  can 
significantly  outperform  the  naive  estimator.  Given  that  the  weighted  estimator  is  simple  to  compute,  we 
recommend  that  it  should  always  be  used. 
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